import pandas as pd
import numpy as np
import scipy.stats as st
import matplotlib.pyplot as plt
import seaborn as sns
from ipywidgets import *
%matplotlib inline

# import sys
# sys.path.append('modules')
# import NeuralNet as nn

import tensorflow as tf
# tf.logging.set_verbosity(tf.logging.ERROR)
from keras import optimizers
from keras import backend as K
from keras import models
from keras import layers
from keras import initializers
from tensorflow.keras.utils import to_categorical

def sinusoidal(x):
    return np.sin(np.pi * x)

def heaviside(x):
    return 0.5 * (np.sign(x) + 1)

Neural Networks

The linear model takes general form \begin{align*} \mathbf{f}(\mathbf{x},\mathbf{w}) = f \left( \sum_{i=0}^{m} w_i \phi_i(\mathbf{x})\right) \end{align*}

where

  • $f(\cdot)$ is a nonlinear activation function (such as $\sigma$) in the case of classification,

  • $f(\cdot)$ is the identity in the case of regression.

Our goal is to extend this model by making the basis functions $\phi_i(\mathbf{x})$ depend on parameters and then to allow these parameters to be adjusted, along with the coefficients $\{w_j\}$ during training.

This leads to the basic neural network model, which can be described a series of functional transformations. Consider a two-layer network diagram:

  • We call it two-layer network because it is the number of layers of adaptive weights.

  • The network has no closed directed cycles so that outputs are deterministic functions of the inputs.

The corresponding network function is

\begin{align*} y_k(\mathbf{x},\mathbf{w}) = f \overbrace{\left( \sum_{j=0}^{\ell_2} w^{(2)}_{kj} h\underbrace{\left( \sum_{i=0}^{\ell_1} w^{(1)}_{ji} x_i\right)}_{a_j}\right)}^{a_k} \end{align*}

The quantities $a_j$, $a_k$ are activations. $h(\cdot)$ is called activation function which is differentiable and nonlinear.

Due to symmetry, multiple distinct choices for the weight vector $\mathbf{w}$ can give the same mapping function.

Keras

Keras is a deep-learning framework for Python that provides a convenient way to define and train almost any kind of deep-learning model.

Let's look at an example of applying neural network on the MINST dataset to classify handwritten digits.

There is a set of 60,000 training images, plus 10,000 test images, assembled by the National Institute of Standards and Technology (NIST). Each image is a gray scale 28 $\times$ 28 pixels handwritten digits. we’re trying to classify images into their 10 categories (0 through 9).

from keras.datasets import mnist
(train_images, train_labels), (test_images, test_labels) = mnist.load_data()
print('training images:{}, test images:{}'.format(train_images.shape, test_images.shape))

def showimg(data):
    idx  = 3
    span = 5
    if data=='train':
        images = train_images
        labels = train_labels
    if data=='test':
        images = test_images
        labels = test_labels
    plt.figure(figsize=(10,2))
    for i in range(5):
        plt.subplot(1, 5, i + 1)
        digit = images[idx+i]
        plt.imshow(digit, cmap=plt.cm.binary)
        plt.title('Index:{}, Label:{}'.format(idx+i, labels[idx+i]), fontsize = 12)
    plt.show()
    
showimg('train')
showimg('test')

Network Architecture

Tensors are fundamental to the data representations for neural networks — so fundamental that Google’s TensorFlow was named after them.

  • Scalars: 0 dimensional tensors

  • Vectors: 1 dimensional tensors

  • Matrix: 2 dimensional tensors

Let's make data tensors more concrete with real-world examples:

  • Vector data — 2D tensors of shape (samples, features)

  • Timeseries data or sequence data — 3D tensors of shape (samples, timesteps, features)

  • Images — 4D tensors of shape (samples, height, width, channels) or (samples, channels, height, width)

  • Video — 5D tensors of shape (samples, frames, height, width, channels) or (samples, frames, channels, height, width)

The core building block of neural networks is the layer, a data-processing module working as a filter for data. Specifically, layers extract representations out of the data fed into them in a more useful form which is often called features.

Most of deep learning consists of chaining together simple layers that will implement a form of progressive data distillation. A deep-learning model is like a sieve for data processing, made of a succession of increasingly refined data filters the layers.

network = models.Sequential()
network.add(layers.Dense(512, activation='relu', input_shape=(28 * 28,)))
network.add(layers.Dense(10, activation='softmax'))

Here, our network consists of a sequence of two densely connected (fully connected) layers. The second (and last) layer is a 10-way softmax layer, which means it will return an array of 10 probability scores (summing to 1). Each score will be the probability that the current digit image belongs to one of our 10 digit classes.

Compilation

Before training the network, we need to perform a compilation step by setting up:

  • An optimizer: the mechanism to improve its performance on the training data

  • A loss function: the measurement of its performance on the training data

  • Metrics to monitor during training and testing

network.compile(optimizer='rmsprop',
                loss='categorical_crossentropy',
                metrics=['accuracy'])

Data Preparation

Before training, we preprocess our data by reshaping and scaling it. We also need to categorically encode the labels so that

train_images_reshape = train_images.reshape((60000, 28 * 28))
train_images_reshape = train_images_reshape.astype('float32') / 255
test_images_reshape = test_images.reshape((10000, 28 * 28))
test_images_reshape = test_images_reshape.astype('float32') / 255

train_labels_cat = to_categorical(train_labels)
test_labels_cat = to_categorical(test_labels)

Training the Network

We train the network as follows

network.fit(train_images_reshape, train_labels_cat, epochs=5, batch_size=128, verbose=1);

The network will start to iterate on the training data in mini-batch of 128 samples, 5 times over (each iteration over all the training data is called an epoch). At each iteration, the network will compute the gradient of the weights with regard to the loss on the batch, and update the weights accordingly. After these 5 epochs, the network will have performed 2345 = 5 $\times$ ceil(60000 $\div$ 128) gradient updates.

Batch size impacts learning significantly. If your batch size is big enough, this will provide a stable enough estimate of what the gradient of the full dataset would be. By taking samples from your dataset, you estimate the gradient while reducing computational cost significantly.

The lower you go, the less accurate your estimate will be, however in some cases these noisy gradients can actually help escape local minimum. When it is too low, your network weights can just jump around if your data is noisy and it might be unable to learn or it converges very slowly, thus negatively impacting total computation time.

train_images_reshape = train_images.reshape((60000, 28 * 28))
train_images_reshape = train_images_reshape.astype('float32') / 255
test_images_reshape = test_images.reshape((10000, 28 * 28))
test_images_reshape = test_images_reshape.astype('float32') / 255

train_labels_cat = to_categorical(train_labels)
test_labels_cat = to_categorical(test_labels)

def train_MNIST():
    # We can run this function to see the trainning output.
    model = models.Sequential()
    model.add(layers.Dense(512, activation='relu', input_shape=(28 * 28,)))
    model.add(layers.Dense(10, activation='softmax'))
    model.compile(optimizer='rmsprop', loss='categorical_crossentropy', metrics=['accuracy'])
    model.fit(train_images_reshape, train_labels_cat, epochs=5, batch_size=128, verbose=1);
    model.save(PATH+"\\minst.h5")
    
# train_MNIST()
    
model = models.load_model(PATH+"\\minst.h5")
test_loss, test_acc = model.evaluate(test_images_reshape, test_labels_cat)
print('Test accuracy is {}%'.format(round(test_acc*100,2)))

def misclassifiedimg():
    predicted = np.argmax(model.predict(test_images_reshape), axis=-1)
    result = abs(predicted - test_labels)
    misclassified = np.where(result>0)[0]
    print('Total number of misclassified images is {}'.format(misclassified.shape[0]),
         'Examples of misclassified images {}-{}'.format(index, index+4))

    plt.figure(figsize=(13,3))
    index = 1
    for i in range(5):
        plt.subplot(1, 5, i + 1)
        idx = misclassified[i+index]
        digit = test_images[idx]
        plt.imshow(digit, cmap=plt.cm.binary)
        plt.title('Predicted:{}, Label:{}'.format(predicted[idx], test_labels[idx]), fontsize = 12)
    plt.show()
    
misclassifiedimg()
313/313 [==============================] - 1s 1ms/step - loss: 0.0753 - accuracy: 0.9774
Test accuracy is 97.74%
Total number of misclassified images is 226 Examples of misclassified images 1-5

Feedfoward Neural Networks

Feedfoward Neural Networks (Deep feedforward networks, multilayer perceptrons (MLPs)) are the quintessential deep learning models. We demonstrate the capability of a two-layer network to model a broad range of functions, such as $x^2$, $\sin(x)$, $\text{abs}(x)$ and $\text{heaviside}(x)$.

The prediction results and outputs for 3 hidden units are shown in the graph.

func_list = [np.square, sinusoidal, np.abs, heaviside]
def create_data(func, n=50):
    x = np.linspace(-1, 1, n)[:, None]
    return x, func(x)

def train_feedfowardNN():
    for i, func in enumerate(func_list):
        x_train, y_train = create_data(func)
        model = models.Sequential()
        model.add(layers.Dense(3, activation='tanh', input_shape=(1,), name='mid_layer'))
        model.add(layers.Dense(1))
        model.compile(optimizer='Adam', loss='mean_squared_error', metrics=['mse'])
        model.fit(x_train, y_train, epochs=10000, batch_size=1, verbose=0);
        y = model.predict(x_test)
        intermediate_output = models.Model(inputs=model.input,
                                           outputs=model.get_layer('mid_layer').output).predict(x_test)
        df = pd.DataFrame(data=np.concatenate((y, intermediate_output), axis=1),
                          columns=['y', 'unit1', 'unit2', 'unit3'])
        df.to_csv(PATH+'\\results_{}.csv'.format(i), header=True, index=False)
        K.clear_session()
        del model
        
# train_feedfowardNN()        
        
x_test = np.linspace(-1, 1, 1000)
plt.figure(figsize=(12, 9))
for i, func in enumerate(func_list):
    plt.subplot(2, 2, i+1)
    x_train, y_train = create_data(func)
    df = pd.read_csv(PATH+'\\results_{}.csv'.format(i))
    plt.title("func = {}".format(func.__name__))
    plt.scatter(x_train, y_train, s=5, label='func')
    plt.plot(x_test, df.y, color="r", label='fitting')
    for j in range(3):
        plt.plot(x_test, df['unit'+str(j+1)], linestyle='dashed', label=r"hidden unit-{}".format(j))
plt.legend(bbox_to_anchor=(1.02, 0.55), loc=2, borderaxespad=0.5)
plt.show()        

Network Training

We minimize the error function (generally nonconvex) \begin{align*} E(\mathbf{w}) = \frac{1}{2} \sum_{i=1}^{n} \lVert \mathbf{f}(\mathbf{x}_i, \mathbf{w}) - \mathbf{y}_i \rVert^2 = \sum_{i=1}^{n} E_i(\mathbf{w}) \end{align*}

Error Backpropagation is applied to derive $\partial E_i \big/ \partial w_{ji}$:

  • Apply an input vector $\mathbf{x}_n$ to the network and forward propagate through the network to find the activations of all the hidden and output units where
\begin{align*} z_j = h(a_j), \text{ and } a_j = \sum_i w_{ji} z_i \end{align*}

In the recursion, $z_i = x_i$ for the input and $z_j = y_k$ for the output.

  • Evaluate $\delta_k = y_k - t_k$ for all the output units where
\begin{align*} \delta_j \equiv \frac{\partial E_i}{\partial a_j} \end{align*}
  • Obtain $\delta_j$ for each hidden unit in the network by backpropagation: \begin{align*} \delta_j \equiv \frac{\partial E_i}{\partial a_j} = \sum_{k} \frac{\partial E_i}{\partial a_k} \frac{\partial a_k}{\partial a_j} = h^\prime(a_j) \sum_{k} w_{kj} \delta_k \end{align*}

Note that the recursion starts from the output and goes through the network backward.

  • Evaluate the required derivatives
\begin{align*} \frac{\partial E_i}{\partial w_{ji}} = \frac{\partial E_i}{\partial a_{j}} \frac{\partial a_j}{\partial w_{ji}} = \delta_j z_i \end{align*}

The technique of backpropagation can also be applied to the calculation of Jacobian and Hessian matrix where

\begin{align*} J_{ki} \equiv \frac{\partial y_k}{\partial x_i}, \text{ and } \mathbf{H} \sim \frac{\partial^2 E}{\partial w_{ji} \partial w_{lk}} \end{align*}

We consider a simple task: learning the XOR function to demonstrate the training process.

The XOR function ("exclusive or") is an operation on two binary values such that

\begin{align*} f(x_1, x_2) = \left\{ \begin{aligned} &0 && \text{ if } x_1 = x_2 \\ &1 && \text{ otherwise.} \end{aligned} \right. \end{align*}
from sklearn.linear_model import LinearRegression

x_train = np.array([[0, 0],[0, 1],[1, 0],[1, 1]])
t_train = np.array([0, 1, 1, 0])
x0, x1 = np.meshgrid(np.linspace(0, 1, 100), np.linspace(0, 1, 100))
x_test = np.array([x0, x1]).reshape(2, -1).T
t = LinearRegression().fit(x_train, t_train).predict(x_test)

plt.figure(figsize=(4, 3))
plt.scatter(x_train[:, 0], x_train[:, 1], c=t_train)
levels = np.linspace(0, 1, 5)
plt.contourf(x0, x1, np.asarray(t).reshape(100, 100), levels, alpha=0.2)
plt.colorbar()
plt.show()

The dots indicate the target values of inputs. It is clear that linear function cannot describe the feature, because any hyperplane will have both target values 0 and 1 on one-side. This is demonstrated by a linear regression model.

We provide details on learning the XOR function by using a simple neural network.

We have

  • input
\begin{align} \mathbf{x} = \left( \begin{array}{l} 0 \\ 0 \\ 1 \\ \end{array} \right) \text{ or } \left( \begin{array}{l} 0 \\ 1 \\ 1 \\ \end{array} \right) \text{ or } \left( \begin{array}{l} 1 \\ 0 \\ 1 \\ \end{array} \right) \text{ or } \left( \begin{array}{l} 1 \\ 1 \\ 1 \\ \end{array} \right) \nonumber \end{align}
  • hidden layer
\begin{align} a^1_{j} = {\mathbf{w}^1_j}^T \mathbf{x} \text{, } \mathbf{a}^1 = (a^1_{j} :\forall j)^T \text{ and } z_{j} = h(a^1_{j}) \text{, } \mathbf{z} = (z_{j}: \forall j)^T \nonumber \end{align}
  • output
\begin{align} a^2_{k} = {\mathbf{w}^2_{k}}^T \mathbf{z} \text{ and } y = y_k = h(a^2_{k}) \nonumber \end{align}

Note that, in this case, $k = \{1\}$.

Feed-Forward

To apply stochastic gradient decent, we consider one instance and its error function

\begin{align} E_i = \frac{1}{2} \left( y - t \right)^2 \nonumber \end{align}\begin{align} \mathbf{z} = h\left({\mathbf{W}^1}^T \mathbf{x} \right) \text{ and } y = h\left({\mathbf{w}^2}^T \mathbf{z}\right) \nonumber \end{align}

where

\begin{align} \mathbf{W}^{1} = (\mathbf{w}^1_{j} : \forall j) \nonumber \end{align}

Derivatives

Let $h = \tanh$.

\begin{align} \sigma^\prime(x) &= \sigma(x) (1- \sigma(x)) \nonumber\\ \tanh^\prime(x) &= 1 -\tanh^2(x) \nonumber \end{align}

Error Backpropagation

\begin{align} \delta^2_{k} &\equiv \frac{\partial E_i}{\partial a^2_k} = \frac{\partial E_i}{\partial y} \frac{\partial y}{\partial a^2_k} = (y-t) h^\prime(a^2_k) = (y-t) (1- h^2(a^2_k)) = (y-t) (1- y^2) \nonumber \\ \delta^1_{j} &\equiv \frac{\partial E_i}{\partial a^1_{j}} = \sum_k \frac{\partial E_i}{\partial a^2_{k}} \frac{\partial a^2_{k}}{\partial z_j} \frac{\partial z_j}{\partial a^1_{j}} = h^\prime(a^1_{j}) \sum_{k} \delta^2_{k} w^2_{k,j} = (1-z_j^2) \sum_{k} \delta^2_{k} w^2_{k,j} \nonumber \end{align}\begin{align} \frac{\partial E_i}{\partial \mathbf{w}^2_k} & = \frac{\partial E_i}{\partial a^2_k} \frac{\partial a^2_k}{\partial \mathbf{w}^2_k} = \delta^2_{k} \mathbf{z} \nonumber \\ \frac{\partial E_i}{\partial \mathbf{w}^1_j} & = \frac{\partial E_i}{\partial a^1_j} \frac{\partial a^1_j}{\partial \mathbf{w}^1_j} = \delta^1_{j} \mathbf{x} \nonumber \end{align}

The code realize the neural network and backpropagation algorithm. The plot shows that the XOR function is correctly learned by our neural network.

def sigmoid(x):
    return 1.0/(1.0 + np.exp(-x))

def sigmoid_prime(x):
    return x*(1.0-x)

def tanh(x):
    return np.tanh(x)

def tanh_prime(x):
    return 1.0 - x**2

class NeuralNetwork:
    def __init__(self, layers, activation='tanh'):
        if activation == 'sigmoid':
            self.activation = sigmoid
            self.activation_prime = sigmoid_prime
        elif activation == 'tanh':
            self.activation = tanh
            self.activation_prime = tanh_prime
        
        # Set weights
        self.weights = []
        # layers = [2,2,1]
        # range of weight values (-1,1)
        # input and hidden layers - random((2+1, 2+1)) : 3 x 3
        for i in range(1, len(layers) - 1):
            r = 2*np.random.random((layers[i-1] + 1, layers[i] + 1)) -1
            self.weights.append(r)
        # output layer - random((2+1, 1)) : 3 x 1
        r = 2*np.random.random( (layers[i] + 1, layers[i+1])) - 1
        self.weights.append(r)

    def fit(self, X, t, learning_rate=0.2, epochs=100000):
        ones = np.atleast_2d(np.ones(X.shape[0]))
        X = np.concatenate((ones.T, X), axis=1)
        # Add the bias unit to the input layer X, so that
        # X = [[1. 0. 0.]
        #      [1. 0. 1.]
        #      [1. 1. 0.]
        #      [1. 1. 1.]]
         
        for k in range(epochs):
            i = np.random.randint(X.shape[0])
            z = [X[i]]
            
            for l in range(len(self.weights)):
                z.append(self.activation(np.array(self.weights[l]).T @ np.array(z[l])))

            # backpropagation
            # output layer y = z[-1]
            error = z[-1] - t[i]
            deltas = [error * self.activation_prime(z[-1])]
            for l in range(len(self.weights) - 1, 0, -1): 
                deltas.append(np.dot(self.weights[l], deltas[-1]) * self.activation_prime(z[l]))
            deltas.reverse()

            for i in range(len(self.weights)):
                self.weights[i] -= learning_rate * np.array(z[i])[:,None].dot(np.array(deltas[i])[:,None].T)

            if k % 10000 == 0: print('epochs:{}, error={}'.format(k, error))

    def predict(self, X): 
        ones = np.atleast_2d(np.ones(X.shape[0]))
        y = np.concatenate((ones.T, X), axis=1)
        for l in range(0, len(self.weights)):
            y = self.activation(np.dot(y, self.weights[l]))
        return y
    
nn = NeuralNetwork([2,2,1], activation='tanh')
nn.fit(x_train, t_train)
np.set_printoptions(suppress=True)
# print(np.append(x_train, nn.predict(x_train), axis=1))
plt.figure(figsize=(4, 3))
plt.scatter(x_train[:, 0], x_train[:, 1], c=t_train)
x0, x1 = np.meshgrid(np.linspace(0, 1, 100), np.linspace(0, 1, 100))
x_test = np.array([x0, x1]).reshape(2, -1).T
levels = np.linspace(0, 1, 5)
plt.contourf(x0, x1, np.asarray(nn.predict(x_test)).reshape(100, 100), levels, alpha=0.2)
plt.colorbar()
plt.show()    
epochs:0, error=[-0.05189828]
epochs:10000, error=[-0.00155147]
epochs:20000, error=[-0.00846595]
epochs:30000, error=[-0.00634869]
epochs:40000, error=[0.00145377]
epochs:50000, error=[-0.00012334]
epochs:60000, error=[-0.00458919]
epochs:70000, error=[-0.00002698]
epochs:80000, error=[-0.00367707]
epochs:90000, error=[0.00637885]

Regression

We are attempting to predict the median price of homes in a given Boston suburb in the mid-1970s, given a few data points about the suburb at the time, such as the crime rate, the local property tax rate, etc.

The dataset has very few data points, only 506 in total, split between 404 training samples and 102 test samples, and each "feature" in the input data (e.g. the crime rate is a feature) has a different scale. For instance some values are proportions, which take a values between 0 and 1, others take values between 1 and 12, others between 0 and 100.

from keras.datasets import boston_housing
(train_data, train_targets), (test_data, test_targets) =  boston_housing.load_data()

train_data.shape, test_data.shape, train_targets[:10]

The data comprises 13 features as follows:

  • Per capita crime rate.

  • Proportion of residential land zoned for lots over 25,000 square feet.

  • Proportion of non-retail business acres per town.

  • Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).

  • Nitric oxides concentration (parts per 10 million).

  • Average number of rooms per dwelling.

  • Proportion of owner-occupied units built prior to 1940.

  • Weighted distances to five Boston employment centres.

  • Index of accessibility to radial highways.

  • Full-value property-tax rate per $10,000.

  • Pupil-teacher ratio by town.

  • $1000 \times (B_k - 0.63)^2$ where $B_k$ is the proportion of Black people by town.

  • \% lower status of the population.

The targets are the median values of owner-occupied homes, in thousands of dollars. The prices are typically between $10,000 and $50,000.